JIT: Add support for byte/sbyte SIMD multiplication on XArch #86811

BladeWise · 2023-05-26T20:20:41Z

The PR adds support for SIMD-based byte/sbyte multiplication on XArch.

Given that there is no specific hardware intrinsic, the hardware acceleration is achieved widening/narrowing vectors, depending on the supported ISA.

The basic algorithm is the following:

In case of a Vector256<byte>/Vector256<sbyte> input and if Avx512BW is supported, the vectors are widened as Vector512<ushort>/Vector512<short>, multiplied and then narrowed back.
In case of a Vector128<byte>/Vector128<sbyte> input and if Avx2 is supported, the vectors are widended as Vector256<ushort>/Vector256<short>, multiplied and then narrowed back.
In any other case, the vectors are splitted in halves (lower/upper) and widened as vectors of the same size of the input (e.g., one Vector128<byte> produces two Vector128<ushort>, each having half of the original vector), each half is multiplied with his counterpart and results are narrowed into a single vector.

SPMI reports some regressions, but as far as I could see the previous implementation (software fallback) was not inlined, while the current one is.

The following code

public static Vector128<byte> MultiplyB128(Vector128<byte> v1, Vector128<byte> v2) => v1 * v2;

Produces this codegen

G_M65351_IG01:  ;; offset=0000H
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00
G_M65351_IG02:  ;; offset=0003H
       vpmovzxbw ymm0, qword ptr [rdx]
       vpmovzxbw ymm1, qword ptr [r8]
       vpmullw  ymm0, ymm0, ymm1
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
       vpackuswb ymm0, ymm0, ymm0
       vpermq   ymm0, ymm0, -40
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=39 bbWeight=1 PerfScore 21.25
G_M65351_IG03:  ;; offset=002AH
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh

Implement SIMD-based byte/sbyte multiplication for Vector128/256/512 through widening/narrowing

ghost · 2023-05-26T20:20:53Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

The PR adds support for SIMD-based byte/sbyte multiplication on XArch.

Given that there is no specific hardware intrinsic, the hardware acceleration is achieved widening/narrowing vectors, depending on the supported ISA.

The basic algorithm is the following:

In case of a Vector256<byte>/Vector256<sbyte> input and if Avx512BW is supported, the vectors are widened as Vector512<ushort>/Vector512<short>, multiplied and then narrowed back.
In case of a Vector128<byte>/Vector128<sbyte> input and if Avx2 is supported, the vectors are widended as Vector256<ushort>/Vector256<short>, multiplied and then narrowed back.
In any other case, the vectors are splitted in halves (lower/upper) and widened as vectors of the same size of the input (e.g., one Vector128<byte> produces two Vector128<ushort>, each having half of the original vector), each half is multiplied with his counterpart and results are narrowed into a single vector.

SPMI reports some regressions, but as far as I could see the previous implementation (software fallback) was not inlined, while the current one is.

The following code

public static Vector128<byte> MultiplyB128(Vector128<byte> v1, Vector128<byte> v2) => v1 * v2;

Produces this codegen

G_M65351_IG01:  ;; offset=0000H
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00
G_M65351_IG02:  ;; offset=0003H
       vpmovzxbw ymm0, qword ptr [rdx]
       vpmovzxbw ymm1, qword ptr [r8]
       vpmullw  ymm0, ymm0, ymm1
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
       vpackuswb ymm0, ymm0, ymm0
       vpermq   ymm0, ymm0, -40
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=39 bbWeight=1 PerfScore 21.25
G_M65351_IG03:  ;; offset=002AH
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh

Author:	BladeWise
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

src/coreclr/jit/gentree.cpp

Checked codegen only, no AVX512 tests run (hardware unavailable)

BladeWise · 2023-05-29T22:18:02Z

I have updated the PR to fix the issue with AVX512.
Source code got longer to address SIMD16 handling with and without AVX512BW_VL support.

kunalspathak · 2023-06-05T17:00:06Z

@tannergooding

JulieLeeMSFT · 2023-06-19T16:42:42Z

@tannergooding, please review this community contribution and consider if this can be done in C# instead of in JIT.

BruceForstall

@tannergooding Could this code instead be implemented in C# in (for example) the Vector128<T>::operator* fallback implementation, using all the same hardware intrinsics, ISA checking, etc.? Or is there some reason why that either can't or shouldn't be done?

BruceForstall · 2023-07-07T03:59:37Z

src/coreclr/jit/gentree.cpp

+                case TYP_BYTE:
+                case TYP_UBYTE:


It looks like there is some precedent for inserting additional IR nodes in this function, but this is quite a lot more than the other cases, and these new cases return without falling through like the others. Is this the right place for this code? Or should this be a separate function that is called before getting here?

tannergooding · 2023-07-07T16:45:00Z

Could this code instead be implemented in C# in (for example) the Vector128::operator* fallback implementation, using all the same hardware intrinsics, ISA checking, etc.? Or is there some reason why that either can't or shouldn't be done?

It "could" be, but there are certain pessimizations that I believe make doing that the wrong choice for these APIs. Most notably because these APIs are considered "perf critical" and one of the primary ways users interact with the SIMD types.

One of the pessimizations is that the IR produced by inlining still differs and is "sub-optimal" in comparison. It is often inserting additional copies, locals, temporaries, or other handling at the "boundary" just before/after the inlined code. A lot of this has been improved with the introduction of forward sub, but there are still notable edge cases. Likewise, it is still dependent on the inliner and while we've made some improvements to that (particularly in adjusting the budget when many intrinsic calls exist), there are still plenty of cases (especially in more complex methods) where the budget is quickly eaten by APIs like this being in managed and it will simply give up. That is often less problematic for other APIs, but represents a particular concern when you're considering APIs that are exposing primitive functionality used to write all your other algorithms.

We ultimately are generating roughly 6 instructions of very specialized code for what is a baseline arithmetic operation on the SIMD types. If we were designing IL "today", we would potentially have SIMD type support built into the IL and so this would be the same handling we'd be doing for CEE_MUL in that case. Because its a baseline arithmetic operation, we want such cases to participate in VN, CSE, potentially in constant folding, to be properly costed, etc.

Notably, it would probably be better for this logic to be in lowering or morph instead and for us to just track it as Vector128_op_Multiply in the front end. That's how we handle other primitive operations that tend to be slightly more expensive on some platforms as well (such as conversions, checked arithmetic, long decomposition, etc). But, as is, its consistent with the existing handling for other similar scenarios..

Longer term I would definitely like to see the issues called out above addressed, even if that's done via special casing these APIs in the inliner/etc instead. Once we're confident that's the case, then we should look at incrementally moving all similar functionality into managed instead. We might still end up with some bits staying as handled by the JIT (such as for truly simple operations where equivalent handling exists for int, long, float, or double), but it would allow all of simdashwintrinsic to be removed (they are really user defined types that are getting special optimizations today and are not ABI primitives) and many of the more complex helpers (like Dot or Sum) to have their complexity removed from the JIT.

…ultiply

JulieLeeMSFT · 2023-07-31T16:37:22Z

Since it is past .NET 8 code complete, setting this to .NET 9.

…ultiply

BladeWise · 2023-09-30T15:52:22Z

@BruceForstall I have merged the branch and fixed the break with current sources.
I will adress the other concerns (more comments with examples and specific variables for new nodes) as soon as tests are completed.

DeepakRajendrakumaran · 2023-10-02T16:35:11Z

@DeepakRajendrakumaran @anthonycanino @tannergooding @dotnet/avx512-contrib Any additional comments/suggestions on this code?

Sorry for the late reply. Changes look good at first glance. Couldn't find any similar implementation in native compilers to compare disasm though

src/coreclr/jit/hwintrinsicxarch.cpp

…ultiply

BladeWise · 2023-10-06T20:13:42Z

@BruceForstall as per your request:

I have re-merged with latest main and addressed other concerns.
More comments have been added to explain what each branch wants to do, and samples of the equivalent C# code for relevant statements have been provided. I did not provide an example of the generated ASM because it could be too verbose in some cases. Let me know if you prefer I add it anyway in sources or as a comment to this PR.
All new nodes now use new variables, named to explicitly indicate the contents.

BladeWise · 2023-10-07T10:56:57Z

Test failures seem unrelated to the PR (Build Analysis failure logs report 'No space left on device').
Is it possible to re-run failed checks?

BruceForstall · 2023-10-09T05:10:06Z

/azp run runtime-coreclr superpmi-diffs

azure-pipelines · 2023-10-09T05:10:23Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2023-10-09T05:15:00Z

/azp run runtime-coreclr jitstress-isas-avx512, runtime-coreclr jitstress-isas-x86

azure-pipelines · 2023-10-09T05:15:16Z

Azure Pipelines successfully started running 2 pipeline(s).

BladeWise · 2023-10-09T18:12:48Z

Failing checks are related to #93163.

BruceForstall

@BladeWise Thank you for your work on this, and your patience for how long it took to get it in.

And thank you for your contribution to .NET!

BladeWise · 2023-10-09T19:02:56Z

@BruceForstall thanks to you for reviewing and support!

Add support for byte/sbyte SIMD multiply on XArch

5c6fdc1

Implement SIMD-based byte/sbyte multiplication for Vector128/256/512 through widening/narrowing

ghost added the community-contribution Indicates that the PR has been added by a community member label May 26, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 26, 2023

SingleAccretion reviewed May 26, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

BladeWise added 6 commits May 26, 2023 22:39

Fix formatting

b065946

Replace gtCloneExpr with fgMakeMultiUse (review)

6846733

Merge branch 'main' into feature/simd-byte-multiply

e6193d8

Merge branch 'main' into feature/simd-byte-multiply

1fa1b6c

Attempt to fix codepath for AVX512

516b94f

Checked codegen only, no AVX512 tests run (hardware unavailable)

Merge branch 'main' into feature/simd-byte-multiply

c621093

build-analysis bot mentioned this pull request May 29, 2023

AppHost.Bundle.Tests.BundleLocalizedApp.Bundled_Localized_App_Run_Succeeds failing #85993

Closed

BladeWise added 3 commits May 30, 2023 19:20

Merge branch 'main' into feature/simd-byte-multiply

c5e5911

Merge branch 'main' into feature/simd-byte-multiply

a5162a7

Merge branch 'main' into feature/simd-byte-multiply

3dfc0b9

EgorBo mentioned this pull request Jun 4, 2023

Performance regression of XXHash128 for large buffers with .NET 8 #87107

Open

BladeWise added 2 commits June 5, 2023 21:34

Merge branch 'main' into feature/simd-byte-multiply

339cb2f

Merge branch 'main' into feature/simd-byte-multiply

c2b5e2e

JulieLeeMSFT assigned BladeWise and tannergooding Jun 19, 2023

JulieLeeMSFT requested a review from tannergooding June 19, 2023 16:41

JulieLeeMSFT requested a review from BruceForstall June 26, 2023 16:32

BruceForstall reviewed Jul 7, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feature/simd-byte-m…

9471e8d

…ultiply

JulieLeeMSFT added this to the 9.0.0 milestone Jul 31, 2023

BladeWise added 2 commits September 30, 2023 17:34

Merge remote-tracking branch 'upstream/main' into feature/simd-byte-m…

56680db

…ultiply

Fix renamed method to check for SIMD 512 support

3ac62c6

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 30, 2023

build-analysis bot mentioned this pull request Sep 30, 2023

Test_EventSource_EtwManifestGeneration* tests failing in CI #48798

Closed

DeepakRajendrakumaran reviewed Oct 2, 2023

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

BladeWise added 2 commits October 4, 2023 20:19

Avoid to re-use variables for newly created nodes

024a87c

Merge remote-tracking branch 'upstream/main' into feature/simd-byte-m…

81b1e07

…ultiply

build-analysis bot mentioned this pull request Oct 5, 2023

tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.sh Timed Out on Mono OSX x64 with interpreter #92733

Open

BruceForstall added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 5, 2023

BladeWise added 3 commits October 6, 2023 18:52

Merge remote-tracking branch 'upstream/main' into feature/simd-byte-m…

22912c1

…ultiply

Add detailed comments

b1a9686

Merge remote-tracking branch 'upstream/main' into feature/simd-byte-m…

74bbff9

…ultiply

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 6, 2023

Fix formatting

cdde321

This was referenced Oct 7, 2023

Build crashes in System.Runtime.Serialization.SerializationGuard #92007

Closed

MSBuild crashing in the build #92290

Open

BruceForstall approved these changes Oct 9, 2023

View reviewed changes

BruceForstall merged commit 9330273 into dotnet:main Oct 9, 2023
147 of 153 checks passed

BladeWise deleted the feature/simd-byte-multiply branch October 9, 2023 19:03

ghost locked as resolved and limited conversation to collaborators Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Add support for byte/sbyte SIMD multiplication on XArch #86811

JIT: Add support for byte/sbyte SIMD multiplication on XArch #86811

BladeWise commented May 26, 2023

ghost commented May 26, 2023

BladeWise commented May 29, 2023

kunalspathak commented Jun 5, 2023

JulieLeeMSFT commented Jun 19, 2023

BruceForstall left a comment •

edited

Loading

BruceForstall Jul 7, 2023

tannergooding commented Jul 7, 2023 •

edited

Loading

JulieLeeMSFT commented Jul 31, 2023

BladeWise commented Sep 30, 2023

DeepakRajendrakumaran commented Oct 2, 2023

BladeWise commented Oct 6, 2023

BladeWise commented Oct 7, 2023 •

edited

Loading

BruceForstall commented Oct 9, 2023

azure-pipelines bot commented Oct 9, 2023

BruceForstall commented Oct 9, 2023

azure-pipelines bot commented Oct 9, 2023

BladeWise commented Oct 9, 2023

BruceForstall left a comment

BladeWise commented Oct 9, 2023

JIT: Add support for byte/sbyte SIMD multiplication on XArch #86811

JIT: Add support for byte/sbyte SIMD multiplication on XArch #86811

Conversation

BladeWise commented May 26, 2023

ghost commented May 26, 2023

BladeWise commented May 29, 2023

kunalspathak commented Jun 5, 2023

JulieLeeMSFT commented Jun 19, 2023

BruceForstall left a comment • edited Loading

Choose a reason for hiding this comment

BruceForstall Jul 7, 2023

Choose a reason for hiding this comment

tannergooding commented Jul 7, 2023 • edited Loading

JulieLeeMSFT commented Jul 31, 2023

BladeWise commented Sep 30, 2023

DeepakRajendrakumaran commented Oct 2, 2023

BladeWise commented Oct 6, 2023

BladeWise commented Oct 7, 2023 • edited Loading

BruceForstall commented Oct 9, 2023

azure-pipelines bot commented Oct 9, 2023

BruceForstall commented Oct 9, 2023

azure-pipelines bot commented Oct 9, 2023

BladeWise commented Oct 9, 2023

BruceForstall left a comment

Choose a reason for hiding this comment

BladeWise commented Oct 9, 2023

BruceForstall left a comment •

edited

Loading

tannergooding commented Jul 7, 2023 •

edited

Loading

BladeWise commented Oct 7, 2023 •

edited

Loading